ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE by jiangshhh · Pull Request #19171 · ggml-org/llama.cpp

jiangshhh · 2026-01-29T06:48:18Z

Proposal

This proposal introduces an ARM SVE-optimized implementation of ggml_vec_dot_mxfp4_q8_0 for the ggml/llama.cpp CPU backend.
The current implementation relies on scalar or NEON-based code paths, which do not fully utilize the wide vector capabilities available on modern ARM CPUs equipped with Scalable Vector Extension(SVE). By leveraging SVE intrinsics, this proposal aims to:

Improve utilization of vector registers on SVE-capable platforms, independent of fixed vector widths
Maintain numerical equivalence with the existing reference implementation
Ensure portability across different SVE vector lengths

Verifying Features

The proposed SVE implementation was verified with the following considerations:

Functional Correctness
Accumulation logic and scaling factors follow the original ggml_vec_dot_mxfp4_q8_0 definition.
Architectural Safety
The implementation uses SVE intrinsics only, without assuming a fixed vector length.
The SVE path is guarded by __ARM_FEATURE_SVE to ensure it is executed only on supported hardware.
Fallback Compatibility
Non-SVE platforms continue to use the existing scalar or NEON implementations without modification.
The change does not affect other quantization paths.

Performance check

The performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

Batch size	Original (NEON)	This PR (SVE)	Ratio
1	3.66	8.60	2.35
2	3.73	9.04	2.42
4	3.76	9.25	2.46
8	3.75	9.08	2.42

The command used to measure the performance is
llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

jiangshhh · 2026-01-29T06:59:05Z

@ggerganov @slaren
Hi,

The PR introduces an ARM SVE optimization for ggml_vec_dot_mxfp4_q8_0, and I have verified correctness and performance on an SVE-capable platform.

This is my first PR to llama.cpp, so I would like to check if there are any additional steps that I should follow for the review.
Please let me know if I need to do something to start the review/approval process.

Thank you very much for your time and for maintaining this project.

taronaeo · 2026-01-29T12:31:24Z

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Alcpz · 2026-01-29T12:37:07Z

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Unfortunately no, I would be happy to help otherwise.

taronaeo · 2026-02-06T08:41:19Z

I've just spun up an AWS c8gn.2xlarge that has SVE and SVE2 to test this and I can't seem to reproduce the same result that you're getting. Am I missing something?

`upstream/master`

$ build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	pp512	49.33 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	29.12 ± 0.01

build: f9bd518 (7955)

`pr/19171`

$ build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	pp512	47.07 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	28.82 ± 0.06

build: 18ad28c (7870)

gcc dump

$ echo | gcc -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme -dM -E - | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_SVE_BF16 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_BITPERM 1

lscpu

$ lscpu | grep -E "Model name|Flags"

Model name:                              Neoverse-V2
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti

jiangshhh · 2026-02-16T08:44:11Z

@taronaeo
Thank you very much for your careful review and for pointing out the performance difference between FX700 and Graviton4.

Regarding the SVE optimization for mxfp4, we initially observed approximately 2x performance improvement on FX700 (A64FX), while no significant speedup was observed on Graviton4 (Neoverse V2). This behavior is consistent with the underlying SIMD microarchitecture.

As shown in the reference table, there are some differs across the following architectures.

SVE vector length x number of SVE pipelines
NEON 128-bit width x number of NEON pipelines

A64FX (FX700)

SVE configuration: 2 x 512-bit
NEON configuration: 2 x 128-bit
SVE FP64 FMLA peak: 32
NEON FP64 FMLA peak: 8

Here, we can know that SVE provides a clear width and throughput advantage over NEON.
Therefore, moving from NEON to SVE can theoretically deliver up to ~4x compute width improvement. In practice, we observed approximately 2x speedup, which is consistent with architectural expectations.

Neoverse V2 (Graviton4/NVIDIA Grace)

SVE configuration: 4 x 128-bit
NEON configuration: 4 x 128-bit
SVE FP64 FMLA peak: 16
NEON FP64 FMLA peak: 16

In this case, the effective SIMD throughput of SVE and NEON is architecturally equivalent. Although SVE provides a more flexible programming model, the raw vector width and pipeline count are effectively the same as NEON.
Therefore, equal performance (rather than 2x speedup) is the expected result on Neoverse V2-based systems.

Additional Measurement on NVIDIA Grace

After the latest refinement of the implementation, we re-measured performance on NVIDIA Grace (Neoverse V2) using llama-bench (8 threads, 512 prompt tokens, 128 generation tokens, 5 repetitions).
Before (baseline build 7955)

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	pp512	39.49 ± 0.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	28.12 ± 0.01

After (PR build 7957)

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	pp512	59.02 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	36.97 ± 0.03

Summary

On A64FX (512-bit SVE) → SVE has a clear hardware throughput advantage over NEON → 2x speedup observed.
On Neoverse V2 (Graviton4/NVIDIA Grace) → SVE and NEON have equivalent peak SIMD throughput → large speedup is not theoretically expected.
On NVIDIA Grace, after the modification of SVE, from the benchmark results we achieved ~1.2–1.4x improvement due to better implementation efficiency rather than wider SIMD capability.

I hope this clarifies the architectural reason behind the observed performance differences.
Please let me know if anything is needed further.

Thank you again for the valuable feedback.

Optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE

18ad28c

jiangshhh requested a review from ggerganov as a code owner January 29, 2026 06:48

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 29, 2026

jiangshhh changed the title ~~ggml: pptimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE~~ ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171
jiangshhh wants to merge 1 commit intoggml-org:masterfrom
jiangshhh:sve-ggml_vec_dot_mxfp4_q8_0-opt

jiangshhh commented Jan 29, 2026 •

edited

Loading

Uh oh!

jiangshhh commented Jan 29, 2026

Uh oh!

taronaeo commented Jan 29, 2026

Uh oh!

Alcpz commented Jan 29, 2026

Uh oh!

taronaeo commented Feb 6, 2026

Uh oh!

jiangshhh commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiangshhh commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Verifying Features

Performance check

Uh oh!

jiangshhh commented Jan 29, 2026

Uh oh!

taronaeo commented Jan 29, 2026

Uh oh!

Alcpz commented Jan 29, 2026

Uh oh!

taronaeo commented Feb 6, 2026

upstream/master

pr/19171

gcc dump

lscpu

Uh oh!

jiangshhh commented Feb 16, 2026

Additional Measurement on NVIDIA Grace

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangshhh commented Jan 29, 2026 •

edited

Loading

`upstream/master`

`pr/19171`